6 research outputs found
Astaroth: Ohjelmistokirjasto stensiililaskentaan grafiikkasuorittimilla
Graphics processing units (GPUs) are coprocessors, which offer higher throughput and better power efficiency than central processing units in dataparallel tasks. For this reason, graphics processors provide a good platform for high-performance computing. However, programming GPUs such that all the available performance is utilized requires in-depth knowledge of the architecture of the hardware. Additionally, the problem of high-order stencil computations on GPUs in challenging multiphysics applications has not been adequately explored in previous work. In this thesis, we address these issues by presenting a library, an efficient algorithm and a domain-specific language for solving stencil computations within a structured grid. We tested our implementation by simulating magnetohydrodynamics, which involved the computation of first, second, and cross partial derivatives using second-, fourth-, sixth-, and eight-order finite differences with single and double precision. The running time of our integration kernel was 2.8–9.1 times slower than the theoretical minimum time, which it would take to read the computational domain and write it back to device memory exactly once, without taking into account the effects of finite caches or arithmetic operations on performance. Additionally, we made a performance comparison with a CPU solver widely used for scientific computations, which we benchmarked on a total of 24 cores of two Intel Xeon E5-2690 v3 processors. Our solver, benchmarked on a Tesla P100 PCIe GPU, outperformed the CPU solver by factors of 6.7 and 10.4 when using single and double precision, respectively.Grafiikkasuorittimet ovat apusuorittimia, jotka tarjoavat rinnakkain laskettavissa tehtävissä parempaa suoritus- ja energiatehokkuutta kuin keskussuorittimet. Tästä syystä grafiikkasuorittimet tarjoavat hyvän alustan suurteholaskennan tarpeisiin. Toisaalta grafiikkasuorittimen ohjelmointi siten, että kaikki tarjolla oleva suorituskyky saadaan hyödynnettyä, vaatii syvällistä asiantuntemusta ohjelmoitavan laitteiston arkkitehtuurista. Korkean asteen stensiililaskentaa haastavissa fysiikkasovelluksissa ei ole myöskään tutkittu laajalti aiemmissa julkaisuissa. Tässä työssä otamme kantaa näihin ongelmiin esittelemällä ohjelmistokirjaston, tehokkaan algoritmin, sekä tehtävään räätälöidyn ohjelmointikielen stensiililaskujen ratkaisemiseen säännöllisessä hilassa. Testasimme toteutustamme simuloimalla magnetohydrodynamiikkaa, johon kuului ensimmäisen ja toisen kertaluvun derivaattojen lisäksi ristiderivaattojen ratkaisutoisen, neljännen, kuudennen ja kahdeksannen kertaluvun differenssimenetelmällä käyttäen sekä 32- että 64-bittisiä liukulukuja. Integrointifunktiomme suoritusaika oli 2.8–9.1 kertaa hitaampi kuin teoreettinen vähimmäisajoaika, joka menisi laskennallisen alueen lukemiseen ja kirjoittamiseen apusuorittimen muistista täsmälleen kerran, ottamatta huomioon äärellisen välimuistin tai laskennan vaikutusta suoritusaikaan. Vertasimme kirjastomme suoritusaikaa laajalti tieteellisessä laskennassa käytettyyn keskussuorittimille tarkoitettuun ratkaisijaan, jonka ajoimme kokonaisuudessaan 24:llä ytimellä kahdella Intel Xeon E5-2690 v3 -suorittimella. Tähän ratkaisijaan verrattuna Tesla P100 PCIe -grafiikkasuorittimella ajettu ratkaisijamme oli 6.7 ja 10.4 kertaa nopeampi 32- ja 64-bittisillä liukuluvuilla laskettaessa, tässä järjestyksessä
Scalable communication for high-order stencil computations using CUDA-aware MPI
Modern compute nodes in high-performance computing provide a tremendous level
of parallelism and processing power. However, as arithmetic performance has
been observed to increase at a faster rate relative to memory and network
bandwidths, optimizing data movement has become critical for achieving strong
scaling in many communication-heavy applications. This performance gap has been
further accentuated with the introduction of graphics processing units, which
can provide by multiple factors higher throughput in data-parallel tasks than
central processing units. In this work, we explore the computational aspects of
iterative stencil loops and implement a generic communication scheme using
CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations
based on high-order finite differences and third-order Runge-Kutta integration.
We put particular focus on improving intra-node locality of workloads. In
comparison to a theoretical performance model, our implementation exhibits
strong scaling from one to devices at -- efficiency in
sixth-order stencil computations when the problem domain consists of
-- cells.Comment: 17 pages, 15 figure
Interaction of large- and small-scale dynamos in isotropic turbulent flows from GPU-accelerated simulations
Magnetohydrodynamical (MHD) dynamos emerge in many different astrophysical
situations where turbulence is present, but the interaction between large-scale
(LSD) and small-scale dynamos (SSD) is not fully understood. We performed a
systematic study of turbulent dynamos driven by isotropic forcing in isothermal
MHD with magnetic Prandtl number of unity, focusing on the exponential growth
stage. Both helical and non-helical forcing was employed to separate the
effects of LSD and SSD in a periodic domain. Reynolds numbers (Rm) up to
were examined and multiple resolutions used for convergence
checks. We ran our simulations with the Astaroth code, designed to accelerate
3D stencil computations on graphics processing units (GPUs) and to employ
multiple GPUs with peer-to-peer communication. We observed a speedup of
in single-node performance compared to the widely used multi-CPU
MHD solver Pencil Code. We estimated the growth rates both from the averaged
magnetic fields and their power spectra. At low Rm, LSD growth dominates, but
at high Rm SSD appears to dominate in both helically and non-helically forced
cases. Pure SSD growth rates follow a logarithmic scaling as a function of Rm.
Probability density functions of the magnetic field from the growth stage
exhibit SSD behaviour in helically forced cases even at intermediate Rm. We
estimated mean-field turbulence transport coefficients using closures like the
second-order correlation approximation (SOCA). They yield growth rates similar
to the directly measured ones and provide evidence of quenching. Our
results are consistent with the SSD inhibiting the growth of the LSD at
moderate Rm, while the dynamo growth is enhanced at higher Rm.Comment: 22 pages, 23 figures, 2 tables, Accepted for publication in the
Astrophysical Journa
The Pencil Code, a modular MPI code for partial differential equations and particles: multipurpose and multiuser-maintained
| openaire: EC/H2020/227952/EU//ASTRODYN | openaire: EC/H2020/818665/EU//UniSDynPeer reviewe